This jupyter notebook tutorial provides a quick introduction to the basic functionalities of the quasildr package.
To run this notebook, you need some extra python dependencies for single-cell data preprocessing and plotting. If you have not installed them before, you can install with pip install scanpy plotly.
import pandas as pd
import scanpy as sc
import numpy as np
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from quasildr.graphdr import *
We will load a dataset containing 18,213 cells from developing mouse dentate gyrus from Hochgerner et al. 2018.
dataanno = pd.read_csv('../example/Dentate_Gyrus.anno.gz',sep='\t',index_col=0)
dataanno = dataanno.loc[:,'ClusterName']
data = pd.read_csv('../example/Dentate_Gyrus.spliced_data.gz',sep='\t',index_col=0)
adata = sc.AnnData(data.values.T, data.columns.values,data.index.values)
adata.var_names_make_unique()
sc.pp.log1p(adata)
sc.pp.scale(adata)
sc.tl.pca(adata)
We will first visualize the first 3 PCs. For this dataset, the different cell types are mixed in the PCA visualization.
pca50 = adata.obsm['X_pca']
pca50 = pca50/pca50[:,0].std()
traces = [go.Scatter3d(
x = np.asarray(pca50[dataanno.values==ct,0]).flatten(),
y = np.asarray(pca50[dataanno.values==ct,1]).flatten(),
z = np.asarray(pca50[dataanno.values==ct,2]).flatten(),
marker=dict(
size=1,
opacity=1,
),
mode='markers',
name=ct
) for ct in np.unique(dataanno)]
iplot(traces)
We will provide PCA transformed data as input to GraphDR. Applying a linear dimensionality reduction method like PCA before applying GraphDR is generally recommended.
For demonstration purpose, we first used no_rotation=True which makes sure the output is in the same linear subspace as the PCAs with no rotation, enabling easy comparison. By default, graphdr function preserve the dimensionality of its input, but you can also specify n_components to reducce dimensionality. (Tip: For very big dataset, specifying n_components and no_rotation=True can significantly speed up computation).
import warnings
warnings.filterwarnings("ignore")
Z = graphdr(pca50,regularization=500,refine_iter=0,no_rotation=True,rescale=True)
#`recale=True` makes sure that the GraphDR output is scaled to have the same mean and variance as input, facilitating comparison.
traces = [go.Scatter3d(
x = np.asarray(Z[dataanno.values==ct,0]).flatten(),
y = np.asarray(Z[dataanno.values==ct,1]).flatten(),
z = np.asarray(Z[dataanno.values==ct,2]).flatten(),
marker=dict(
size=0.5,
opacity=1,
),
mode='markers',
name=ct
) for ct in np.unique(dataanno)]
iplot(traces)
With GraphDR reprsentation, the cell types are now much better separated than PCA, while the overal interpretation of the space is preserved.
In this case there are a small number of cells floating in between branches. This is because nearest neigbor graph can contain spurious edges connecting cells that belong to different cell types. This usually only happens for a small fraction of cells, but we can fix this by performing the optional extra steps of automatically prune spurious graph edges, as shown below:
Z = graphdr(pca50,regularization=500,refine_threshold=12,refine_iter=4,no_rotation=True,rescale=True)
traces = [go.Scatter3d(
x = np.asarray(Z[dataanno.values==ct,0]).flatten(),
y = np.asarray(Z[dataanno.values==ct,1]).flatten(),
z = np.asarray(Z[dataanno.values==ct,2]).flatten(),
marker=dict(
size=0.5,
opacity=1,
),
mode='markers',
name=ct
) for ct in np.unique(dataanno)]
iplot(traces)
We can also let GraphDR find the optimal rotation / subspace:
Z = graphdr(pca50,regularization=500,refine_threshold=12,refine_iter=4,no_rotation=False,rescale=True)
traces = [go.Scatter3d(
x = np.asarray(Z[dataanno.values==ct,0]).flatten(),
y = np.asarray(Z[dataanno.values==ct,1]).flatten(),
z = np.asarray(Z[dataanno.values==ct,2]).flatten(),
marker=dict(
size=0.5,
opacity=1,
),
mode='markers',
name=ct
) for ct in np.unique(dataanno)]
iplot(traces)
Last note: GraphDR can also use GPU to accelerate the computation even more. Assuming you have a CUDA-enabled GPU with the cupy package installed, just use use_cuda=True!
We now apply StructDR to extract developmental trajectory structure from GraphDR representation.
StructDR unifies cluster, trajectory, and surface extraction under the nonparametric density ridge estimation framework. Clusters, trajectories, or surfaces can be identified as zero-, one-, or two-dimensional density ridges, which are well defined given a probability density function. Specifically, StructDR first performs a kernel density estimation step to estimate the probability density function of cells and then extract the density ridges of the estimated density function.
from quasildr.structdr import *
Z = graphdr(pca50,regularization=1000/1.8213,refine_threshold=12,refine_iter=3,no_rotation=True,rescale=False)
Z = Z / Z[:,0].std()
s = Scms(Z[:,:15], bw=0.1, min_radius = 10)
#`bw` controls how much smoothing is applied. `min_radius` here set the minimum bw of a cell to 10th nearest neighbor distance.
#For very large data set with >>10,000 cells. We recommend using the multilevel representation to reduce the data with structdr.multilevel_compression before given to Scms.
Here we subsample representative cells to reduce computation time for projecting cells to density ridges. The density function is estimated with the full set of cells, therefore the density ridges remain unchanged and we just project a smaller number of representative cells to density ridges.
#This step can be skipped to map all cells to density ridges / trajectories if you have enough computing resource / time.
inds = s.inverse_density_sampling(Z[:,:15], n_samples=2000)
T1, ifilter = s.scms(X=Z[inds,:15],n_jobs = 4, stepsize=1, n_iterations=30, ridge_dimensionality=1, relaxation=0, batch_size=16)
#`ridge_dimensionality` specifies the ridge dimensionality, here we set it to 1 (1D density ridge). `n_iterations` controls how many iterations the algorithm is run for. `stepsize` specify multiplier to the standard SCMS step size. Set `batch_size` to a small number if memory is a constraint.
We visualize the extracted 1D density ridge which corresponds to developmental trajectory here.
traces = [go.Scatter3d(
x = np.asarray(T1[dataanno.values[inds]==ct,0]).flatten(),
y = np.asarray(T1[dataanno.values[inds]==ct,1]).flatten(),
z = np.asarray(T1[dataanno.values[inds]==ct,2]).flatten(),
marker=dict(
size=1,
opacity=1,
),
mode='markers',
name=ct
) for ct in np.unique(dataanno[inds])]
iplot(traces)
For applications that need a graph representation of trajectories, we can construct graph from the projected trajectory positions.
#extract structural backbones
from quasildr.utils import *
g_simple, g_mst, ridge_dims = extract_structural_backbone(T1, Z[inds,:15], s, max_angle=90, relaxation=0)
#It may produce NumbaPerformanceWarnings which can be safely ignored.
#Two graphs are constructed by this function. `g_simple` does not assume all cells to be connected in the graph and does not attempt to connect cells far apart. `g_simple` also adapts to the ridge dimensionality. `g_mst` will extract a minimum spanning tree that connects all cells.
We visualize the trajectory together with extracted graph representation here:
list_x = []
list_y = []
list_z = []
list_color = []
for i, j in zip(*g_mst.nonzero()):
xs = [T1[i,0], T1[j,0]]
ys = [T1[i,1], T1[j,1]]
zs = [T1[i,2], T1[j,2]]
list_x.extend(xs)
list_y.extend(ys)
list_z.extend(zs)
list_x.append(None)
list_y.append(None)
list_z.append(None)
traces = [go.Scatter3d(
x = np.asarray(T1[dataanno.values[inds]==ct,0]).flatten(),
y = np.asarray(T1[dataanno.values[inds]==ct,1]).flatten(),
z = np.asarray(T1[dataanno.values[inds]==ct,2]).flatten(),
marker=dict(
size=1,
opacity=1,
),
mode='markers',
name=ct
) for ct in np.unique(dataanno[inds])] + [go.Scatter3d(
x=list_x,
y=list_y,
z=list_z,
mode='lines',
line=dict(
color=list_color,
width= 2,
showscale=False,
),
name='MST',)]
iplot(traces)
The 1D density ridge / trajectory represents most parts of the dataset well. But it is not optimal to represent, for example, differentiating cells undergoing cell cycle here. To address this, we can allow StructDR to adaptively select between 1D/2D representation:
T, ifilter = s.scms(X=Z[inds,:15],n_jobs = 3, stepsize=1, n_iterations=10, ridge_dimensionality=1, relaxation=0.5, batch_size=16)
#`relaxation` is set to a value between 0-1. 0 gives fully 1D density ridge representation, while specifying a larger number will adaptively allow part of density ridges to be 2D depending on the local curvature of density function.
We visualize the density ridges extracted with adaptive dimensionality in the example below, where color indicates density ridge dimensionality:
ridge_dims = s.relax_bool + 1
traces = [go.Scatter3d(
x = np.asarray(T[:,0]).flatten(),
y = np.asarray(T[:,1]).flatten(),
z = np.asarray(T[:,2]).flatten(),
marker=dict(
size=0.8,
opacity=1,
color=ridge_dims,
),
mode='markers',
) ]
iplot(traces)
For more details of the inner work of StructDR and topics like statistical inference of confidence sets, check out the StructDR tutorial.